Unsupervised Classification with Stochastic Complexity
نویسندگان
چکیده
1 Problem Statement In unsupervised classication, we are given a collection of samples and must label them to show their class membership, without knowing anything about the underlying data generating machinery, not even the number of classes. That is, we are given some sequence of observed objects 1; 2 Now we must assign each object to one of a number of classes C = C 1 ; : : : ; C c in an \optimal" fashion. A solution to this problem, then, consists of (i) a measure of the quality of a given classication, and (ii) an algorithm for classifying a given set of objects. Two popular denitions of optimality are predictive error and data reduction. \Predictive error" measures the ability of the classier to correctly predict the likelihood and class membership of future samples. According to this criterion, the goal of classication is to eciently discretize the continuous k-dimensional space of measurements in a manner that preserves the probability density of that space. A typical application is vector quantization of speech signals. \Data reduction" measures the classier's ability to oer insight into large collections of high-dimensional data. According to this criterion, the goal of classication is to reduce an infeasibly large collection of data to a smaller, more feasible set in a manner that preserves the underlying structure of the original collection. A typical application is visualization of census data. Unsupervised classication, sometimes called \clustering," is an essential tool in data analysis and pre-theoretical scientic inquiry. It has important applications in many elds of study, includ-The problem of unsupervised classication presents us with two profound diculties. The rst diculty is that the observations are not labeled with their classes, and therefore we do not know the correct number of underlying classes. If our classier postulates too few or too many classes, then it distorts the density of the measurement space. It's ability to predict future data will suer, as will its ability to oer insight into the global structure of the observations. The second diculty inherent to unsupervised classication is one of computational complexity. In order to preserve the probability density of the measurement space, the number of classes and
منابع مشابه
Gaussian Processes for Big Data through Stochastic Variational Inference
Gaussian processes [GP 10] are perhaps the dominant approach for inference on functions. They underpin a range of algorithms for regression, classification and unsupervised learning. Unfortunately, exact inference in a GP has complexity O(n) with storage demands of O(n) and this hinders application of these models for ‘big data’. Various approximate techniques have been suggested [see e.g. 1, 1...
متن کاملLearning Genetic Population Structures Using Minimization of Stochastic Complexity
Considerable research efforts have been devoted to probabilistic modeling of genetic population structures within the past decade. In particular, a wide spectrum of Bayesian models have been proposed for unlinked molecular marker data from diploid organisms. Here we derive a theoretical framework for learning genetic population structure of a haploid organism from bi-allelic markers for which p...
متن کاملDeep Unsupervised Domain Adaptation for Image Classification via Low Rank Representation Learning
Domain adaptation is a powerful technique given a wide amount of labeled data from similar attributes in different domains. In real-world applications, there is a huge number of data but almost more of them are unlabeled. It is effective in image classification where it is expensive and time-consuming to obtain adequate label data. We propose a novel method named DALRRL, which consists of deep ...
متن کاملEvaluating the Effectiveness of Supervised and Unsupervised Classification Methods in Monitoring Regs (Case Study: Jazmourian Reg)
Due to its mobility and ability to move and its direct impact on residential areas and various developmental activities, the Ergs are of major importance in the desert areas, so monitoring of those is very important. Considering that the use of supervised and unguarded methods is considered as one of the most common methods in determining and monitoring land uses, in this research, the accuracy...
متن کاملBayesian unsupervised classification framework based on stochastic partitions of data and a parallel search strategy
Advantages of statistical model-based unsupervised classification over heuristic alternatives have been widely demonstrated in the scientific literature. However, the existingmodel-based approaches are often both conceptually and numerically instable for large and complex data sets. Here we consider a Bayesian model-based method for unsupervised classification of discrete valued vectors, that h...
متن کاملStochastic Unsupervised Learning on Unlabeled Data
In this paper, we introduce a stochastic unsupervised learning method that was used in the 2011 Unsupervised and Transfer Learning (UTL) challenge. This method is developed to preprocess the data that will be used in the subsequent classification problems. Specifically, it performs K-means clustering on principal components instead of raw data to remove the impact of noisy/irrelevant/less-relev...
متن کامل